Christopher Barrie
Thursday, June 17: SICSS-Oxford
website: https://cjbarrie.xyz
github: https://github.com/cjbarrie
Twitter: https://www.twitter.com/cbarrie
DocumentTermMatrix<<DocumentTermMatrix (documents: 2, terms: 12092)>>
Non-/sparse entries: 17581/6603
Sparsity : 27%
Maximal term length: 18
Weighting : term frequency (tf)
Sample :
Terms
Docs country democratic government laws nations people power society time
DiA1 357 212 556 397 233 516 543 290 311
DiA2 167 561 162 133 313 360 263 241 309
Terms
Docs united
DiA1 554
DiA2 227
library(tidyverse) # loads dplyr, ggplot2, and others
library(stringr) # to handle text elements
library(tidytext) # includes set of functions useful for manipulating text
library(topicmodels) # to estimate topic models
library(gutenbergr) # to get text data
library(scales)
library(tm)
library(ggthemes) # to make your plots look nice
tocq <- gutenberg_download(c(815, 816),
meta_fields = "author")
tocq_words <- tocq %>%
mutate(booknumber = ifelse(gutenberg_id==815, "DiA1", "DiA2")) %>%
unnest_tokens(word, text) %>%
count(booknumber, word, sort = TRUE) %>%
ungroup() %>%
anti_join(stop_words)
tocq_dtm <- tocq_words %>%
cast_dtm(booknumber, word, n)
tm::inspect(tocq_dtm)
<<DocumentTermMatrix (documents: 2, terms: 12092)>>
Non-/sparse entries: 17581/6603
Sparsity : 27%
Maximal term length: 18
Weighting : term frequency (tf)
Sample :
Terms
Docs country democratic government laws nations people power society time united
DiA1 357 212 556 397 233 516 543 290 311 554
DiA2 167 561 162 133 313 360 263 241 309 227
tocq_lda <- LDA(tocq_dtm, k = 10, control = list(seed = 1234))
tocq_topics <- tidy(tocq_lda, matrix = "beta")
tocq_topics %>%
arrange(-beta)
# A tibble: 120,920 x 3
topic term beta
<int> <chr> <dbl>
1 4 democratic 0.0193
2 5 people 0.0162
3 6 people 0.0154
4 10 people 0.0130
5 7 government 0.0127
6 3 united 0.0124
7 7 power 0.0121
8 6 power 0.0121
9 4 people 0.0120
10 2 democratic 0.0115
# … with 120,910 more rows
tocq_gamma <- tidy(tocq_lda, matrix = "gamma")
head(tocq_gamma, n = 10)
# A tibble: 10 x 3
document topic gamma
<chr> <int> <dbl>
1 DiA2 1 0.00504
2 DiA1 1 0.0217
3 DiA2 2 0.198
4 DiA1 2 0.00000152
5 DiA2 3 0.00000213
6 DiA1 3 0.197
7 DiA2 4 0.216
8 DiA1 4 0.00000152
9 DiA2 5 0.180
10 DiA1 5 0.00000152